In this exploratory notebook, I'll apply PCA and GMM to the SO-CHIC data suite that has been prepared by Shenjie Zhou (BAS). The results are reasonably distinct clusters, both in PC space and in lat-lon space. The i-metric highlights boundaries between the classes.

Here we are exploring larger values of K

Import modules

Start Dask client

Subset parameters

These values indicate which lat-lon-depth range will be used in the clustering analysis. The first parameter, called "subset", is an arbitrary selection of profiles for quick plotting purposes. That parameter does not affect how many profiles are used in the clustering.

Import data

Shenjie Zhou (BAS) prepared these profiles using MITprof software, which produces a set of profiles that have been interpolated onto a specified set of depth levels. They are all in NetCDF format with standard names. I've decided to use xarray below for ease of use.

Time and date handling

Plot subset of temperature profiles

This is an arbitrary subset of the profiles, just for data visualization purposes.

Plot subset of salinity profiles

Plot histograms to get a sense of the data distribution

First, plot the potential temperature histogram

Next, plot the salinity histogram

Clustering with GMM

Preprocessing/scaling and dimensionality reduction via PCA

Calculate BIC and AIC to inform selection of the number of clusters

Plot the BIC scores

As with many other oceanographic applications, there is not a clear minimum. Instead, the BIC curve flattens. We can opt for a smaller number of classes for ease of interpretation. Choosing a value of 5 is not quite the minimum, but after 5 the decrease in BIC is much more gradual, so it's still a reasonable choice.

Plot the AIC scores

Select the actual GMM to be used in the analysis

Visualise clustering in PC space

The classes look reasonably distinct in PCA space. There are some subtle oddities that might be better captured by a different clustering method; we'll explore later.

Calculate class means and standard deviations

Plot vertical T and S structure of the classes

Label map of the classes

Note the emergence of the near-Antarctic class. One idea would be to take only those profiles and re-apply a clustering algorithm to them in order to capture finer-scale structure.

Calculate the i-metric

https://os.copernicus.org/preprints/os-2021-40/

Loop through profiles (this may take a few minutes)

Plot i-metric map

Darker shading indicates that the profile is less likely to be near a boundary between classes. Ligher shading indicates that the profile is more likely to be near a boundary between classes.

Plot i-metric by class

This view makes it easier to see that the classes are reasonably spatially distinct, with "fuzzy" boundaries in the expected locations.

Save this dataset for further processing